Here, we wanted to map CSL areas of interest onto the Human Phenotype Ontology (HPO) and/or other disease ontologies (OMIM/ORPH). This allowed us to connect our phenotype-cell type association results to the CSL areas of interest. For a description of our methodology, see here:
Identification of cell type-specific gene targets underlying thousands of rare diseases and subtraits Kitty B. Murphy, Robert Gordon-Smith, Jai Chapman, Momoko Otani, Brian M. Schilder, Nathan G. Skene https://www.medrxiv.org/content/10.1101/2023.02.13.23285820v1
First, we:
Gathered all diseases/phenotypes that CSL lists on their website and/or the Research Accelerator Initiative materials.
For each disease/phenotype, we assigned a Group based on whether CSL has expressed interest in Early Stage Partnering with external labs (e.g. via the RAI) or whether they are already actively pursuing research in these fields.
We then grouped each disease/phenotype into a broader Area (“Immunology”) and a particular disease Subarea (“Dermatomyositis”).
Next, we mapped each disease/phenotype onto a standardised HPO ID or OMIM/Orphanet ID.
Next, I took all the HPO IDs and expanded this list to any HPO terms that are descendants of the original HPO IDs. This allows us to capture any phenotypes that are more specific than the original CSL HPO IDs.
Next, we imported the results of our phenotype-cell type association
analysis. This analysis was performed using the MSTExplorer
package and the results are stored in a data.table.
We developed a pipeline to help filter down our results to identify the promise promising gene targets for the most severe phenotypes. It includes a number of different steps, including filtering on cell type specific gene expression and the evidence supporting the causal relationships between each gene and a phenotype.
Ultimately, it allows us to trace multi-scale disease mechanisms from genes –> cell types –> phenotypes –> diseases
Below we show a preview of the top 100 of these prioritised targets:
To determine whether the genes prioritised by our therapeutic targets pipeline were plausible, we checked what percentage of gene therapy targets we recapitulated. Data on therapeutic approval status was gathered from the Therapeutic Target Database (TTD; release 2024-03-21). Overall, we prioritised 79% of all non-failed existing gene therapy targets. A hypergeometric test confirmed that our prioritised targets were significantly enriched for non-failed gene therapy targets (\(p=0.0104\)). Importantly, we did not prioritise any of the failed therapeutics (0%), defined as having been terminated or withdrawn from the market. The hypergeometric test for depletion of failed targets did not reach significance (\(p=0.365\)), but this is to be expected as there was only one failed gene therapy target in the TTD database.
Even when considering therapeutics of any kind, not just gene therapies, we recapitulated 44% of the non-failed therapeutic targets and 0% of the terminated/withdrawn therapeutic targets (n=1255). Here we found that our prioritised targets were significantly enriched for non-failed therapeutics (\(p=3e-19\)), and highly significantly depleted for failed therapeutics (\(p=3e-199\)). This suggests that our multi-scale evidence-based prioritisation pipeline is capable of selectively identifying genes that are likely to be effective therapeutic targets.
Here we summarise the percentage of CSL phenotypes of interest (with HPO IDs) included in our prioritised cell type-specific targets. We also compute the proportion of CSL diseases (with OMIM/Orphanet IDs) that have at least one of those phenotypes as a symptom.
## 1180/1879 (62.8%) CSL phenotypes covered in our phenotype-cell type association results.
## 600/1879 (31.93%) CSL phenotypes (across 4850 diseases) covered in our prioritised targets.
From our list of prioritised cell type-specific gene targets, we can now select the top targets for each CSL area of interest.
We have arbitrarily set a cutoff of 3 targets per area, but this can easily be adjusted as needed.
We can also count the number of targets, genes, phenotypes and diseases per area of interest.
Now that we have the top candidate targets for CSL disease areas, we can explore them in more detail. For example, we can use additional resources (ClinVar, Variant Effect Predictor, gnomad) to find out which variants are deleterious within (or around) the gene targets. We can also gather population-level data on the frequency of these variants, giving us a better idea of the number of patients that may benefit from treating this particular mechanism.
Let’s first explore phenotypes related to “Stroke”.
As an example, we’re going to look specifically at the role of COL4A1 in stroke.
We gathered additional data from the gnomad website on the COL4A1 gene (i.e. ENSG00000187498).
Our goal here was to:
Identify confirmed pathogenic variants in COL4A1.
Check the frequency of these variants in the general population.
Identify a particular exon that has the highest frequency of pathogenic variants. The idea being that this exon could be a good target for therapeutic intervention via ASO-induced exon-skipping.
Additional info:
Exon 3 spans the following genomic coordinates: chr13:110213926-110214015
We found two exons with the highest cumulative frequency of pathogenic variants in the COL4A1 gene. Of these, exon 3 has the highest cumulative frequency of pathogenic variants.
Follow up research into the existing literature was carried out (not shown here), to confirm whether:
any similar therapies currently existing for this gene
exon 3 could be safely skipped
there are any animal models available for the homologous exons in this gene
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.34.0
## [2] fs_1.6.3
## [3] matrixStats_1.2.0
## [4] bitops_1.0-7
## [5] EnsDb.Hsapiens.v75_2.99.0
## [6] httr_1.4.7
## [7] RColorBrewer_1.1-3
## [8] doParallel_1.0.17
## [9] tools_4.3.1
## [10] backports_1.4.1
## [11] utf8_1.2.4
## [12] R6_2.5.1
## [13] DT_0.32
## [14] lazyeval_0.2.2
## [15] GetoptLong_1.0.5
## [16] withr_3.0.0
## [17] prettyunits_1.2.0
## [18] cli_3.6.2
## [19] Biobase_2.62.0
## [20] labeling_0.4.3
## [21] sass_0.4.8
## [22] readr_2.1.5
## [23] ewceData_1.10.0
## [24] Rsamtools_2.18.0
## [25] yulab.utils_0.1.4
## [26] R.utils_2.12.3
## [27] dichromat_2.0-0.1
## [28] orthogene_1.9.1
## [29] maps_3.4.2
## [30] limma_3.58.1
## [31] readxl_1.4.3
## [32] rstudioapi_0.15.0
## [33] RSQLite_2.3.5
## [34] pals_1.9
## [35] generics_0.1.3
## [36] gridGraphics_0.5-1
## [37] shape_1.4.6.1
## [38] BiocIO_1.12.0
## [39] gtools_3.9.5
## [40] crosstalk_1.2.1
## [41] car_3.1-2
## [42] dplyr_1.1.4
## [43] zip_2.3.1
## [44] homologene_1.4.68.19.3.27
## [45] Matrix_1.6-5
## [46] fansi_1.0.6
## [47] S4Vectors_0.40.2
## [48] abind_1.4-5
## [49] R.methodsS3_1.8.2
## [50] lifecycle_1.0.4
## [51] scatterplot3d_0.3-44
## [52] yaml_2.3.8
## [53] carData_3.0-5
## [54] SummarizedExperiment_1.32.0
## [55] gplots_3.1.3.1
## [56] SparseArray_1.2.4
## [57] BiocFileCache_2.10.1
## [58] grid_4.3.1
## [59] blob_1.2.4
## [60] promises_1.2.1
## [61] ExperimentHub_2.10.0
## [62] crayon_1.5.2
## [63] lattice_0.22-5
## [64] GenomicFeatures_1.54.4
## [65] chromote_0.2.0
## [66] KEGGREST_1.42.0
## [67] mapproj_1.2.11
## [68] pillar_1.9.0
## [69] knitr_1.45
## [70] ComplexHeatmap_2.18.0
## [71] KGExplorer_0.99.0
## [72] GenomicRanges_1.54.1
## [73] rjson_0.2.21
## [74] codetools_0.2-19
## [75] glue_1.7.0
## [76] ggfun_0.1.4
## [77] data.table_1.15.2
## [78] vctrs_0.6.5
## [79] png_0.1-8
## [80] treeio_1.26.0
## [81] cellranger_1.1.0
## [82] gtable_0.3.4
## [83] HPOExplorer_1.0.0
## [84] cachem_1.0.8
## [85] xfun_0.42
## [86] openxlsx_4.2.5.2
## [87] S4Arrays_1.2.1
## [88] mime_0.12
## [89] tidygraph_1.3.1
## [90] SingleCellExperiment_1.24.0
## [91] RNOmni_1.0.1.2
## [92] iterators_1.0.14
## [93] simona_1.0.10
## [94] statmod_1.5.0
## [95] interactiveDisplayBase_1.40.0
## [96] ellipsis_0.3.2
## [97] nlme_3.1-164
## [98] ggtree_3.10.1
## [99] EWCE_1.11.3
## [100] bit64_4.0.5
## [101] progress_1.2.3
## [102] filelock_1.0.3
## [103] GenomeInfoDb_1.38.7
## [104] rprojroot_2.0.4
## [105] bslib_0.6.1
## [106] KernSmooth_2.23-22
## [107] colorspace_2.1-0
## [108] BiocGenerics_0.48.1
## [109] DBI_1.2.2
## [110] tidyselect_1.2.1
## [111] processx_3.8.3
## [112] bit_4.0.5
## [113] compiler_4.3.1
## [114] curl_5.2.1
## [115] rvest_1.0.4
## [116] httr2_1.0.0
## [117] xml2_1.3.6
## [118] DelayedArray_0.28.0
## [119] plotly_4.10.4
## [120] rtracklayer_1.62.0
## [121] scales_1.3.0
## [122] caTools_1.18.2
## [123] rappdirs_0.3.3
## [124] stringr_1.5.1
## [125] digest_0.6.35
## [126] piggyback_0.1.5
## [127] rmarkdown_2.26
## [128] XVector_0.42.0
## [129] htmltools_0.5.7
## [130] pkgconfig_2.0.3
## [131] GeneOverlap_1.38.0
## [132] MatrixGenerics_1.14.0
## [133] echodata_0.99.17
## [134] highr_0.10
## [135] dbplyr_2.4.0
## [136] fastmap_1.1.1
## [137] ensembldb_2.26.0
## [138] rlang_1.1.3
## [139] GlobalOptions_0.1.2
## [140] htmlwidgets_1.6.4
## [141] shiny_1.8.0
## [142] farver_2.1.1
## [143] jquerylib_0.1.4
## [144] jsonlite_1.8.8
## [145] BiocParallel_1.36.0
## [146] R.oo_1.26.0
## [147] RCurl_1.98-1.14
## [148] magrittr_2.0.3
## [149] GenomeInfoDbData_1.2.11
## [150] ggplotify_0.1.2
## [151] patchwork_1.2.0
## [152] munsell_0.5.0
## [153] Rcpp_1.0.12
## [154] ape_5.7-1
## [155] babelgene_22.9
## [156] stringi_1.8.3
## [157] zlibbioc_1.48.0
## [158] AnnotationHub_3.10.0
## [159] plyr_1.8.9
## [160] parallel_4.3.1
## [161] Biostrings_2.70.2
## [162] hms_1.1.3
## [163] circlize_0.4.16
## [164] ps_1.7.6
## [165] igraph_2.0.3
## [166] ggpubr_0.6.0
## [167] ggsignif_0.6.4
## [168] reshape2_1.4.4
## [169] biomaRt_2.58.2
## [170] stats4_4.3.1
## [171] gprofiler2_0.2.3
## [172] BiocVersion_3.18.1
## [173] XML_3.99-0.16.1
## [174] evaluate_0.23
## [175] BiocManager_1.30.22
## [176] tzdb_0.4.0
## [177] foreach_1.5.2
## [178] httpuv_1.6.14
## [179] rols_2.30.2
## [180] grr_0.9.5
## [181] tidyr_1.3.1
## [182] purrr_1.0.2
## [183] clue_0.3-65
## [184] ggplot2_3.5.0
## [185] broom_1.0.5
## [186] xtable_1.8-4
## [187] AnnotationFilter_1.26.0
## [188] restfulr_0.0.15
## [189] tidytree_0.4.6
## [190] rstatix_0.7.2
## [191] later_1.3.2
## [192] viridisLite_0.4.2
## [193] MSTExplorer_1.0.0
## [194] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
## [195] Polychrome_1.5.1
## [196] tibble_3.2.1
## [197] websocket_1.4.1
## [198] aplot_0.2.2
## [199] memoise_2.0.1.9000
## [200] AnnotationDbi_1.64.1
## [201] GenomicAlignments_1.38.2
## [202] IRanges_2.36.0
## [203] cluster_2.1.6
## [204] HGNChelper_0.8.1
## [205] here_1.0.1